Skip to content

feat: Add StagehandCrawler with AI-powered browser automation#1854

Open
Mantisus wants to merge 15 commits intoapify:masterfrom
Mantisus:crawlee-stagehand-crawler
Open

feat: Add StagehandCrawler with AI-powered browser automation#1854
Mantisus wants to merge 15 commits intoapify:masterfrom
Mantisus:crawlee-stagehand-crawler

Conversation

@Mantisus
Copy link
Copy Markdown
Collaborator

@Mantisus Mantisus commented Apr 22, 2026

Description

Adds StagehandCrawler - a new browser crawler powered by Stagehand that lets users interact with pages using natural language instead of CSS selectors or XPath. Extends PlaywrightCrawler and inherits all of its features: routing, sessions, autoscaling, proxies, and navigation hooks.

  • StagehandPage extends Playwright Page with four AI methods: act(), extract(), observe(), and execute().
  • StagehandOptions configures the AI model, execution environment (LOCAL / BROWSERBASE), and session parameters.
  • StagehandBrowserPlugin and StagehandBrowserController integrate Stagehand into the browser pool, managing session lifecycle and fingerprint header injection.
  • Because Stagehand controls the browser launch internally and Playwright connects via CDP, only Chromium is supported, and browser configuration is limited to the subset accepted by Stagehand's BrowserLaunchOptions.
  • Added a new guide covering basic usage, AI page operations, and Browserbase integration.

Issues

Testing

  • Added unit tests for the StagehandBrowserController, StagehandBrowserPlugin, and StagehandCrawler with Stagehand mocked out - no real LLM connection required to run the test suite.

@Mantisus Mantisus marked this pull request as ready for review May 3, 2026 22:52
@Mantisus Mantisus self-assigned this May 3, 2026
@Mantisus Mantisus requested a review from Copilot May 3, 2026 22:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class Stagehand integration to Crawlee Python by introducing a StagehandCrawler (built on PlaywrightCrawler) plus corresponding browser-pool plugin/controller, enabling AI-driven page actions (act, extract, observe, execute) while keeping Crawlee’s existing routing/sessions/proxy/navigation-hook features.

Changes:

  • Introduces StagehandCrawler + Stagehand-specific crawling contexts and exports them from crawlee.crawlers.
  • Adds StagehandBrowserPlugin/StagehandBrowserController, StagehandOptions, and StagehandPage, integrated with BrowserPool.
  • Adds Stagehand documentation + examples, updates architecture docs, and replaces the older “Playwright with Stagehand” guide; updates dependencies and adds unit tests.

Reviewed changes

Copilot reviewed 21 out of 23 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
uv.lock Locks new optional Stagehand dependency set and adds stagehand extra resolution entries.
pyproject.toml Adds stagehand optional dependency group and includes it in all.
src/crawlee/browsers/__init__.py Exposes Stagehand browser plugin/controller and types via optional imports.
src/crawlee/browsers/_stagehand_types.py Defines StagehandOptions and StagehandPage AI-method wrappers.
src/crawlee/browsers/_stagehand_browser_plugin.py Implements StagehandBrowserPlugin lifecycle and Stagehand client initialization.
src/crawlee/browsers/_stagehand_browser_controller.py Implements CDP connection + lazy session start, page creation, and header injection for Stagehand.
src/crawlee/crawlers/__init__.py Exposes Stagehand crawler + contexts via optional imports.
src/crawlee/crawlers/_stagehand/__init__.py Adds Stagehand crawler module exports with optional-deps handling.
src/crawlee/crawlers/_stagehand/_stagehand_crawler.py Adds StagehandCrawler built on PlaywrightCrawler and auto-configures a Stagehand BrowserPool.
src/crawlee/crawlers/_stagehand/_stagehand_crawling_context.py Adds Stagehand-specific crawling context dataclasses and type-narrowed page.
src/crawlee/crawlers/_playwright/_playwright_crawler.py Refactors Playwright crawler to support overridable context classes and generic context typing via _build_context.
tests/unit/browsers/test_stagehand_browser_plugin.py Adds unit tests for plugin activation and Stagehand client init parameter wiring.
tests/unit/browsers/test_stagehand_browser_controller.py Adds unit tests for lazy session start, concurrency behavior, proxies, and header behavior.
tests/unit/crawlers/_stagehand/test_stagehand_crawler.py Adds unit tests verifying context types, hook contexts, and StagehandPage AI-method delegation.
docs/guides/stagehand_crawler.mdx New guide documenting StagehandCrawler, options, AI methods, and Browserbase usage.
docs/guides/code_examples/stagehand_crawler/basic_example.py Example demonstrating act() + extract() with JSON schema.
docs/guides/code_examples/stagehand_crawler/browserbase_example.py Example demonstrating Browserbase environment configuration.
docs/guides/playwright_crawler_stagehand.mdx Removes old guide that described manual Stagehand integration with PlaywrightCrawler.
docs/guides/code_examples/playwright_crawler_stagehand/support_classes.py Removes old example support classes for the manual Stagehand integration.
docs/guides/code_examples/playwright_crawler_stagehand/browser_classes.py Removes old example browser plugin/controller classes for the manual Stagehand integration.
docs/guides/code_examples/playwright_crawler_stagehand/stagehand_run.py Removes old “manual integration” runnable example.
docs/guides/architecture_overview.mdx Updates architecture diagrams/text to include StagehandCrawler + contexts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/guides/stagehand_crawler.mdx Outdated
Comment thread src/crawlee/browsers/_stagehand_browser_controller.py Outdated
Comment thread src/crawlee/browsers/_stagehand_browser_controller.py
Comment thread src/crawlee/browsers/_stagehand_browser_controller.py Outdated
Comment thread src/crawlee/crawlers/_stagehand/_stagehand_crawler.py Outdated
Comment thread src/crawlee/crawlers/_stagehand/_stagehand_crawling_context.py
Comment thread src/crawlee/crawlers/_playwright/_playwright_crawler.py
Comment thread src/crawlee/crawlers/_stagehand/_stagehand_crawler.py Outdated
@Mantisus
Copy link
Copy Markdown
Collaborator Author

Mantisus commented May 3, 2026

Docs check fails due to the current versioning logic. ApiLink resolves to /api/class/<ClassName> instead of /api/next/class/<ClassName>, and since these classes don't exist in the stable API docs yet, it causes a broken link error.

@Mantisus Mantisus requested review from janbuchar and vdusek May 4, 2026 00:59
Copy link
Copy Markdown
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly doc-related / style things Maybe you could also align the `.rules.md. file (about the double backticks and line width for docstrings).

Comment on lines +29 to +31
extract_result = extracted.data.result

await context.push_data(cast('dict[str, str | None]', extract_result))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe explicit type rather than cast?

Suggested change
extract_result = extracted.data.result
await context.push_data(cast('dict[str, str | None]', extract_result))
extract_result: dict[str, str | None] = extracted.data.result
await context.push_data(extract_result)

Comment on lines +85 to +97
Browser crawlers use a real browser to render pages, enabling scraping of sites that require
JavaScript. They manage browser instances, pages, and context lifecycles. Crawlee provides
two browser crawlers:

- <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> utilizes the
[Playwright](https://playwright.dev/) library and provides a high-level API for controlling
and navigating browsers. You can learn more about it in the
[Playwright crawler guide](./playwright-crawler).
- <ApiLink to="class/StagehandCrawler">`StagehandCrawler`</ApiLink> extends
`PlaywrightCrawler` with AI-powered browser automation via
[Stagehand](https://github.com/browserbase/stagehand). It adds natural-language methods
(`act`, `extract`, `observe`, `execute`) directly on the page object. You can learn more
about it in the [Stagehand crawler guide](./stagehand-crawler).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not wrap lines in markdown, we don't do that, it will be wrapped by the renderer

Comment on lines +122 to +127
Because Stagehand manages the browser session internally via CDP, only Chromium is supported.
Browser settings are limited to the subset accepted by Stagehand's `BrowserLaunchOptions` -
`headless`, `args`, `viewport`, `proxy`, `locale`, `executable_path`, and a few others.
Features like full browser fingerprinting (canvas, WebGL, screen properties) and incognito
pages are not supported. Fingerprint-consistent HTTP headers (`User-Agent`, `Accept`, `sec-ch-ua`)
are still injected automatically.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +32 to +38
"""Controller for managing a Stagehand-controlled browser instance.

It creates and connects to the browser lazily on the first ``new_page`` call: Stagehand
starts a session, and Playwright then connects to it via CDP. All pages share a single
browser context, as Stagehand creates the browser and its context together during session
initialisation.
"""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please tell Claude to:

  • use the whole 120 chars width for docstrings
  • use only single backticks for symbols -> ''new_page'' -> 'new_page'

this applies to all source code files

initialisation.
"""

AUTOMATION_LIBRARY = 'stagehand'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not internal? Or why is it here at all? Can the Stagehand browser controller be used with another automation library?

"""

def __init__(self, page: Page, session: AsyncSession) -> None:
super().__init__(page._impl_obj) # noqa: SLF001
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We rely on the internal Playwright page attribute? Is it necessary? If so, could you please add a comment here to explain & support it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

StagehandCrawler + Stagehand browser plugin

4 participants